Cory Melendez

Clustering Project

9/17/2020

Github Link: https://github.com/cmelende/ClusteringProject.git

In [303]:
import pandas as pd
from univariateAnalysis import UniVariateAnalysis, UniVariateReport, OutlierFilter
from scipy.stats import zscore
import seaborn as sns
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
import matplotlib.pylab as plt
import numpy as np
from sklearn.cluster import AgglomerativeClustering 
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist  
from sklearn.metrics import silhouette_samples, silhouette_score
In [304]:
df = pd.read_excel("Credit Card Customer Data.xlsx")
all_cols = ['Sl_No','Customer Key','Avg_Credit_Limit','Total_Credit_Cards','Total_visits_bank','Total_visits_online','Total_calls_made']
def print_all_uni_analysis_reports(df,columnNames):
    seperator = '---------------------------------------------'
    for column in columnNames:
        analysis = UniVariateAnalysis(df, column)
        analysis_report = UniVariateReport(analysis)

        print(seperator)
        print(f'\'{column}\' column univariate analysis report')
        print(seperator)

        analysis_report.print_report()
In [305]:
df.head()
Out[305]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3
In [306]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB
In [307]:
df.describe()
Out[307]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
count 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000
mean 330.500000 55141.443939 34574.242424 4.706061 2.403030 2.606061 3.583333
std 190.669872 25627.772200 37625.487804 2.167835 1.631813 2.935724 2.865317
min 1.000000 11265.000000 3000.000000 1.000000 0.000000 0.000000 0.000000
25% 165.750000 33825.250000 10000.000000 3.000000 1.000000 1.000000 1.000000
50% 330.500000 53874.500000 18000.000000 5.000000 2.000000 2.000000 3.000000
75% 495.250000 77202.500000 48000.000000 6.000000 4.000000 4.000000 5.000000
max 660.000000 99843.000000 200000.000000 10.000000 5.000000 15.000000 10.000000
In [308]:
# check to see if there are any NaN
df.isnull().values.any()
Out[308]:
False

1. Univariate Analysis and Data Cleaning

The columns 'Avg_Credit_Limit', 'Total_visits_online' have several outliers. Ill create a df for later use to try out clustering without outliers. There are no categorical variables in the traditional sense, we could make the columns with low ranges ('Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made') into categorial variables by splitting them into n boolean columns where n = (max+1-min). But at this time, I dont see the need to do this. This would also in a real life scenario make our model more brittle, as new rows that have column values outside the current range would require additional work in order to maintain our model.

As for scaling - we can see that all of the non 'total' columns are scaled quite differently than the other columns. Whereas, the 'total' columns range from 0-20, the other columns have a signficantly higher scale to them. We'll keep this in mind so these differences in scales doesnt affect our clustering in a negative way.

I'm not entirely sure what the 'Sl_No' is supposed to represent here, but I am assuming that it is a 'serial number' column. The 'Sl_No' along with the customer key column doesnt seem to really represent anything tangible that we would want to cluster off of, these values are mostly entirely system generated (based on the CIS system the bank uses). In a real world scenario, we would lean on business analysts and product people in order to gain a better understanding on what this means and what, if anything, information can be derived from these columns that would tell us more information about the customer. For now, I am going to remove this columsn when grabbing the scaled dataframes

In [309]:
print_all_uni_analysis_reports(df, all_cols)
---------------------------------------------
'Sl_No' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (1, 660)
Standard deviation:  190.66987176793296
Q1:  165.75
Q2:  330.5
Q3:  495.25
Q4:  660.0
Mean:  330.5
Min:  1
Median:  330.5
Max:  660
Top whisker:  989.5
Bottom whisker:  -328.5
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
---------------------------------------------
'Customer Key' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (11265, 99843)
Standard deviation:  25627.772200050316
Q1:  33825.25
Q2:  53874.5
Q3:  77202.5
Q4:  99843.0
Mean:  55141.44393939394
Min:  11265
Median:  53874.5
Max:  99843
Top whisker:  142268.375
Bottom whisker:  -31240.625
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
---------------------------------------------
'Avg_Credit_Limit' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (3000, 200000)
Standard deviation:  37625.48780422166
Q1:  10000.0
Q2:  18000.0
Q3:  48000.0
Q4:  200000.0
Mean:  34574.242424242424
Min:  3000
Median:  18000.0
Max:  200000
Top whisker:  105000.0
Bottom whisker:  -47000.0
Number of outliers above the top whisker:  39
Indices of higher outlier rows
1) 612
2) 614
3) 615
4) 617
5) 618
6) 619
7) 620
8) 621
9) 622
10) 623
11) 624
12) 626
13) 627
14) 629
15) 630
16) 631
17) 632
18) 633
19) 634
20) 635
21) 636
22) 637
23) 638
24) 639
25) 640
26) 641
27) 644
28) 645
29) 646
30) 647
31) 648
32) 649
33) 650
34) 651
35) 652
36) 654
37) 657
38) 658
39) 659
Number of outliers below the bottom whisker:  0
---------------------------------------------
'Total_Credit_Cards' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (1, 10)
Standard deviation:  2.167834859511195
Q1:  3.0
Q2:  5.0
Q3:  6.0
Q4:  10.0
Mean:  4.706060606060606
Min:  1
Median:  5.0
Max:  10
Top whisker:  10.5
Bottom whisker:  -1.5
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
---------------------------------------------
'Total_visits_bank' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (0, 5)
Standard deviation:  1.631812875791615
Q1:  1.0
Q2:  2.0
Q3:  4.0
Q4:  5.0
Mean:  2.403030303030303
Min:  0
Median:  2.0
Max:  5
Top whisker:  8.5
Bottom whisker:  -3.5
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
---------------------------------------------
'Total_visits_online' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (0, 15)
Standard deviation:  2.9357241204935423
Q1:  1.0
Q2:  2.0
Q3:  4.0
Q4:  15.0
Mean:  2.606060606060606
Min:  0
Median:  2.0
Max:  15
Top whisker:  8.5
Bottom whisker:  -3.5
Number of outliers above the top whisker:  37
Indices of higher outlier rows
1) 1
2) 4
3) 6
4) 612
5) 613
6) 615
7) 616
8) 617
9) 618
10) 619
11) 620
12) 621
13) 622
14) 624
15) 626
16) 627
17) 628
18) 630
19) 631
20) 633
21) 637
22) 639
23) 640
24) 641
25) 642
26) 644
27) 645
28) 647
29) 650
30) 651
31) 653
32) 654
33) 655
34) 656
35) 657
36) 658
37) 659
Number of outliers below the bottom whisker:  0
---------------------------------------------
'Total_calls_made' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (0, 10)
Standard deviation:  2.8653168176227113
Q1:  1.0
Q2:  3.0
Q3:  5.0
Q4:  10.0
Mean:  3.5833333333333335
Min:  0
Median:  3.0
Max:  10
Top whisker:  11.0
Bottom whisker:  -5.0
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
In [310]:
outlier_filter = OutlierFilter(df, all_cols)
df_no_outliers = outlier_filter.get_df_without_outliers()
In [311]:
df_no_outliers.head()
Out[311]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
5 6 58634 20000 3 0 1 8
7 8 37376 15000 3 0 1 1
In [312]:
df_no_outliers.describe()
Out[312]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
count 611.000000 611.000000 611.000000 611.000000 611.000000 611.000000 611.000000
mean 309.057283 54991.862520 26032.733224 4.391162 2.548282 1.929624 3.772504
std 176.674012 25552.363847 21054.003371 1.885889 1.603953 1.588882 2.867963
min 1.000000 11265.000000 3000.000000 1.000000 0.000000 0.000000 0.000000
25% 156.500000 33890.500000 10000.000000 3.000000 1.000000 1.000000 1.000000
50% 309.000000 53898.000000 17000.000000 4.000000 2.000000 2.000000 3.000000
75% 461.500000 76605.000000 39000.000000 6.000000 4.000000 3.000000 6.000000
max 644.000000 99596.000000 100000.000000 9.000000 5.000000 8.000000 10.000000
In [313]:
df_no_outliers.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 611 entries, 0 to 643
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                611 non-null    int64
 1   Customer Key         611 non-null    int64
 2   Avg_Credit_Limit     611 non-null    int64
 3   Total_Credit_Cards   611 non-null    int64
 4   Total_visits_bank    611 non-null    int64
 5   Total_visits_online  611 non-null    int64
 6   Total_calls_made     611 non-null    int64
dtypes: int64(7)
memory usage: 38.2 KB
In [314]:
# Get scaled dataframes with and without outliers
df_customer_numbers_removed = df.iloc[:,2:]
df_customer_numbers_removed.head()
Out[314]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 100000 2 1 1 0
1 50000 3 0 10 9
2 50000 7 1 3 4
3 30000 5 1 1 4
4 100000 6 0 12 3
In [315]:
df_customer_numbers_removed_no_outliers = df_no_outliers.iloc[:,2:]
df_customer_numbers_removed_no_outliers.head()
Out[315]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 100000 2 1 1 0
2 50000 7 1 3 4
3 30000 5 1 1 4
5 20000 3 0 1 8
7 15000 3 0 1 1
In [316]:
df_scaled = df_customer_numbers_removed.apply(zscore)
df_scaled.head()
Out[316]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1.740187 -1.249225 -0.860451 -0.547490 -1.251537
1 0.410293 -0.787585 -1.473731 2.520519 1.891859
2 0.410293 1.058973 -0.860451 0.134290 0.145528
3 -0.121665 0.135694 -0.860451 -0.547490 0.145528
4 1.740187 0.597334 -1.473731 3.202298 -0.203739
In [317]:
df_scaled_no_outliers = df_customer_numbers_removed_no_outliers.apply(zscore)
df_scaled_no_outliers.head()
Out[317]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 3.516095 -1.268962 -0.966082 -0.585560 -1.316473
2 1.139304 1.384480 -0.966082 0.674218 0.079388
3 0.188587 0.323103 -0.966082 -0.585560 0.079388
5 -0.286771 -0.738273 -1.590053 -0.585560 1.475249
7 -0.524450 -0.738273 -1.590053 -0.585560 -0.967508

2. Analysis

Using a pair plot, we'll take a look at each diagonal graph in order to try to grasp how many clusters we may need

'Avg_Credit_Limit': We can see a little bit of a difference between the data with and without outliers, the one without outliers doesnt have as large as a tail. Because of this, i would assume that the no outlier dataframe that we generated may be best here. At most, we'll have maybe 3, there may be two maximas on this graph that would suggest that 3 clusters would be needed.

'Total_Credit_Cards': Using the non outlier df, its easier to see that we will definately need 3 clusters, looking at the df with outliers, its much harder to tell and we would need up to 4.

'Total_visits_bank': We could probably get away with 3 clusters here, but it may be worth to try 4 as it looks like the last half of the non outlier graph has 2 maximas.

'Total_visits_online': Interestingly, the non outlier dataframe gives us some (perhaps better) signficantly different results. In the outlier df, we can see that we would need a maximum of 1 outlier, however the tail on that graph is quite significant which may cause our data to be skewed. If we look at the non outlier df, we can see that we can use at least 3 (maybe 4) clusters

'Total_calls_made': Not a whole lot if different between these two dataframes. We'll probably want at least 3 clusters for this graph.

In [318]:
sns.pairplot(df_scaled, height=2, aspect=2, diag_kind='kde')
Out[318]:
<seaborn.axisgrid.PairGrid at 0x2df26ad8>
2020-09-19T00:48:05.537196 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
In [319]:
sns.pairplot(df_scaled_no_outliers, height=2, aspect=2, diag_kind='kde')
Out[319]:
<seaborn.axisgrid.PairGrid at 0x2cf562f8>
2020-09-19T00:48:11.422458 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

3. K-Means clustering

From previous analysis, lets check for the optimal number of clusters 1-5 since 4 is the most amount of clusters that we determined previously, we wont go much higher than that to check

In [320]:
number_of_clusters = [1,2,3,4,5]


def get_distortions(scaled_df, try_number_of_clusters):
    distortions = []
    for try_num in try_number_of_clusters:
        model = KMeans(n_clusters=try_num)
        model.fit(scaled_df)
        prediction = model.predict(scaled_df)
        distortions.append(sum(np.min(cdist(scaled_df, model.cluster_centers_, 'euclidean'), axis=1)) / scaled_df.shape[0])
    return distortions
In [321]:
mean_distortions = get_distortions(df_scaled, number_of_clusters)

plt.plot(number_of_clusters, mean_distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Out[321]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
2020-09-19T00:48:14.883201 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
In [322]:
mean_distortions = get_distortions(df_scaled_no_outliers, number_of_clusters)

plt.plot(number_of_clusters, mean_distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Out[322]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
2020-09-19T00:48:15.797756 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Either way, using the dataframe with or without outliers, we get about the same result. That is there is an elbow at 3.0 on either dataframe

In [323]:
group_col = 'Group'
In [324]:
k_means_model = KMeans(3)
k_means_model.fit(df_scaled_no_outliers)
prediction = k_means_model.predict(df_scaled_no_outliers)

cluster_labels = k_means_model.fit_predict(df_scaled_no_outliers)

df_scaled_no_outliers_with_group = df_scaled_no_outliers.copy()
df_scaled_no_outliers_with_group[group_col] = prediction
df_scaled_no_outliers_with_group.head(20)
Out[324]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made Group
0 3.516095 -1.268962 -0.966082 -0.585560 -1.316473 2
2 1.139304 1.384480 -0.966082 0.674218 0.079388 2
3 0.188587 0.323103 -0.966082 -0.585560 0.079388 0
5 -0.286771 -0.738273 -1.590053 -0.585560 1.475249 1
7 -0.524450 -0.738273 -1.590053 -0.585560 -0.967508 0
8 -0.999808 -1.268962 -1.590053 0.044329 -0.618542 1
9 -1.094880 -0.207585 -1.590053 -0.585560 1.126284 1
10 -0.762129 -0.207585 -1.590053 1.933996 0.428353 1
11 -0.619522 -0.738273 -1.590053 0.044329 1.126284 1
12 -0.714593 -1.799650 -0.342111 1.933996 1.824214 1
13 -0.809665 -1.799650 -0.966082 1.933996 0.777319 1
14 -0.952272 -1.268962 -0.342111 1.304107 0.777319 1
15 -0.857201 -1.268962 -1.590053 1.933996 1.126284 1
16 -0.524450 -1.268962 -0.966082 0.044329 0.079388 1
17 -0.857201 -1.268962 -1.590053 0.674218 0.079388 1
18 -0.714593 -1.268962 -0.342111 0.044329 1.126284 1
19 -0.334307 -0.207585 -0.966082 1.933996 1.475249 1
20 -0.999808 -0.738273 -0.342111 1.933996 0.428353 1
21 -0.476914 -1.268962 -1.590053 0.674218 1.126284 1
22 -0.952272 -0.207585 -0.966082 1.304107 0.777319 1
In [325]:
silhouette_avg = silhouette_score(df_scaled_no_outliers, cluster_labels)
print("The average silhouette_score for kmeans(3) :", silhouette_avg)
The average silhouette_score for kmeans(3) : 0.3761304224229804
In [326]:
df_scaled_no_outliers_cluster = df_scaled_no_outliers_with_group.groupby([group_col])
df_scaled_no_outliers_cluster.mean()
Out[326]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Group
0 -0.402185 0.578019 0.679674 -0.574557 -0.633781
1 -0.666412 -1.057647 -1.005610 1.016239 1.091545
2 1.486817 0.629650 0.413629 -0.577735 -0.596867
In [327]:
df_scaled_no_outliers_with_group.boxplot(by=group_col, layout=(2,4), figsize=(15,10))
Out[327]:
array([[<AxesSubplot:title={'center':'Avg_Credit_Limit'}, xlabel='[Group]'>,
        <AxesSubplot:title={'center':'Total_Credit_Cards'}, xlabel='[Group]'>,
        <AxesSubplot:title={'center':'Total_calls_made'}, xlabel='[Group]'>,
        <AxesSubplot:title={'center':'Total_visits_bank'}, xlabel='[Group]'>],
       [<AxesSubplot:title={'center':'Total_visits_online'}, xlabel='[Group]'>,
        <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
2020-09-19T00:48:16.835009 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Looking at the mean of our cluster there are a few things that we can derive

When looking at the total visits at the bank, online and calls made. There seems to be a inverse relationship between the three columns. That is, Group1 has the highest value of visits at the bank, but visits online and total calls made are low. Whereas, Group2 has high visits online and calls made but low average of totals visits at the bank. Group2 looks a lot like Group0

Keeping in mind the above statement, the difference between Group0 and Group2 is the average credit limit and the total number of credit cards seems to be close. So there may be some type of relationship or a centroid where if the customer has a higher amount of credit cards, they are more likely to to visit the bank in person versus calling or visiting the bank's website.

We can also see the data points cluster around certain average credit card limits that looks to be somewhat dependent on total credit cards, though as the total number of credit card increases, there is not a large increase in average credit limit, but the trend does seem to be there.

4. Hierarchical Clustering

In [328]:
avg_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')
avg_clustering.fit(df_scaled_no_outliers)
avg_clustering_labels = avg_clustering.fit_predict(df_scaled_no_outliers)

complete_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')
complete_clustering.fit(df_scaled_no_outliers)
complete_clustering_labels = complete_clustering.fit_predict(df_scaled_no_outliers)

ward_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage="ward")
ward_clustering.fit(df_scaled_no_outliers)
ward_clustering_labels = ward_clustering.fit_predict(df_scaled_no_outliers)
In [329]:
labels = 'labels'

df_scaled_no_outliers_with_avg_cluster_labels = df_scaled_no_outliers.copy()
df_scaled_no_outliers_with_avg_cluster_labels[labels] = avg_clustering.labels_

df_scaled_no_outliers_with_complete_cluster_labels = df_scaled_no_outliers.copy()
df_scaled_no_outliers_with_complete_cluster_labels[labels] = complete_clustering.labels_

df_scaled_no_outliers_with_ward_cluster_labels =  df_scaled_no_outliers.copy()
df_scaled_no_outliers_with_ward_cluster_labels[labels] = ward_clustering.labels_

Average method

In [330]:
avg_sil_score = silhouette_score(df_scaled_no_outliers, avg_clustering_labels)
print("The average silhouette_score for hierarchical (average) :", avg_sil_score)
The average silhouette_score for hierarchical (average) : 0.34655119403522955
In [331]:
df_scaled_no_outliers_with_avg_cluster_labels.groupby([labels]).mean()
Out[331]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
labels
0 -0.016489 -0.005086 0.005793 -0.009543 0.004774
1 3.254648 2.180512 -1.278067 3.193774 -0.793025
2 3.516095 -1.268962 -0.966082 -0.585560 -1.316473
In [332]:
Z = linkage(df_scaled_no_outliers, metric='euclidean', method='average')
c, coph_distances = cophenet(Z, pdist(df_scaled_no_outliers))

c
Out[332]:
0.8312295289172809
In [333]:
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
2020-09-19T00:48:38.297584 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
In [334]:
df_scaled_no_outliers_with_avg_cluster_labels.boxplot(by=labels, layout=(2,4), figsize=(15,10))
Out[334]:
array([[<AxesSubplot:title={'center':'Avg_Credit_Limit'}, xlabel='[labels]'>,
        <AxesSubplot:title={'center':'Total_Credit_Cards'}, xlabel='[labels]'>,
        <AxesSubplot:title={'center':'Total_calls_made'}, xlabel='[labels]'>,
        <AxesSubplot:title={'center':'Total_visits_bank'}, xlabel='[labels]'>],
       [<AxesSubplot:title={'center':'Total_visits_online'}, xlabel='[labels]'>,
        <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
2020-09-19T00:49:05.857878 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Complete method

In [335]:
complete_sil_score = silhouette_score(df_scaled_no_outliers, complete_clustering_labels)
print("The average silhouette_score for hierarchical (complete) :", complete_sil_score)
The average silhouette_score for hierarchical (complete) : 0.4811963046129657
In [336]:
df_scaled_no_outliers_with_complete_cluster_labels.groupby([labels]).mean()
Out[336]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
labels
0 0.365096 0.594618 0.584976 -0.595325 -0.616739
1 3.254648 2.180512 -1.278067 3.193774 -0.793025
2 -0.665773 -1.056208 -1.008242 1.009024 1.082270
In [337]:
Z = linkage(df_scaled_no_outliers, metric='euclidean', method='complete')
c, coph_distances = cophenet(Z, pdist(df_scaled_no_outliers))

c
Out[337]:
0.8162425103590711
In [338]:
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=90,  leaf_font_size=10. )
plt.tight_layout()
2020-09-19T00:49:28.228616 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
In [339]:
df_scaled_no_outliers_with_complete_cluster_labels.boxplot(by=labels, layout=(2,4), figsize=(15,10))
Out[339]:
array([[<AxesSubplot:title={'center':'Avg_Credit_Limit'}, xlabel='[labels]'>,
        <AxesSubplot:title={'center':'Total_Credit_Cards'}, xlabel='[labels]'>,
        <AxesSubplot:title={'center':'Total_calls_made'}, xlabel='[labels]'>,
        <AxesSubplot:title={'center':'Total_visits_bank'}, xlabel='[labels]'>],
       [<AxesSubplot:title={'center':'Total_visits_online'}, xlabel='[labels]'>,
        <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
2020-09-19T00:49:56.640571 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Ward method

In [340]:
ward_sil_score = silhouette_score(df_scaled_no_outliers, complete_clustering_labels)
print("The average silhouette_score for hierarchical (ward) :", ward_sil_score)
The average silhouette_score for hierarchical (ward) : 0.4811963046129657
In [341]:
df_scaled_no_outliers_with_ward_cluster_labels.groupby([labels]).mean()
Out[341]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
labels
0 1.182260 0.632266 0.510839 -0.571113 -0.610539
1 -0.665773 -1.056208 -1.008242 1.009024 1.082270
2 -0.642873 0.565172 0.657701 -0.581876 -0.626705
In [342]:
Z = linkage(df_scaled_no_outliers, metric='euclidean', method='ward')
c, coph_dists = cophenet(Z , pdist(df_scaled_no_outliers))

c
Out[342]:
0.7946519309058675
In [ ]:
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=600,  leaf_font_size=10. )
plt.tight_layout()
In [ ]:
df_scaled_no_outliers_with_ward_cluster_labels.boxplot(by=labels, layout=(2,4), figsize=(15,10))

6. KMeans vs Hierarchical

KMeans and Hierarchical (ward) look similar, all of the 'total' boxplots of their clusters look almost identical. The only thing that differs is the Avg credit limit

Hierachical (complete) and Hierachical (average) look similar to each other, the only difference looks to be the spread of the datapoints are from the centroid.

7.

The Hierarchical (average) and Hierarchical (complete), according to their boxplots, have smaller 'boxes' which means that the range of the distance from each point to the centroid is much smaller

Like stated above, KMeans and Hierarchical (ward) are similar, they differ from the other two Hierarchical methods in that the range of the distance from each point to the centroid is bigger than the Hierarchical (average) and the Hierarchical (complete)

Additionally, KMeans and Hierarchical (ward) clusters seems to be distributed more evenly across all columns. We can see according to the box plot that the boxes are closer in size to the other boxes in each column. Meaning that the range (or we can think of it as a 'spread') of the distances from each point to the centroid are more congruent to the other groups in that column. Whereas, the Hierarchical (average) and Hierarchical (complete) have some very large boxes (spread) in some of the columns while the other groups of that column are more concentrated around the centroid (low spread)